GPUs are ubiquitous in modern computers. Following are GPUs today’s typical computer systems.
NVIDIA GPUs
Tesla K80
GTX 1080
GT 650M
Computers
servers, cluster
desktop
laptop
Main usage
scientific computing
daily work, gaming
daily work
Memory
24 GB
8 GB
1GB
Memory bandwidth
480 GB/sec
320 GB/sec
80GB/sec
Number of cores
4992
2560
384
Processor clock
562 MHz
1.6 GHz
0.9GHz
Peak DP performance
2.91 TFLOPS
257 GFLOPS
Peak SP performance
8.73 TFLOPS
8228 GFLOPS
691Gflops
GPU architecture vs CPU architecture.
* GPUs contain 100s of processing cores on a single card; several cards can fit in a desktop PC
* Each core carries out the same operations in parallel on different input data – single program, multiple data (SPMD) paradigm
* Extremely high arithmetic intensity if one can transfer the data onto and results off of the processors quickly
2 GPGPU in Julia
GPU support by Julia is under active development. Check JuliaGPU for currently available packages.
There are multiple paradigms to program GPU in Julia, depending on the specific hardware.
CUDA is an ecosystem exclusively for Nvidia GPUs. There are extensive CUDA libraries for scientific computing: CuBLAS, CuRAND, CuSparse, CuSolve, CuDNN, …
The CUDA.jl package allows defining arrays on Nvidia GPUs and overloads many common operations.
The AMDGPU.jl package allows defining arrays on AMD GPUs and overloads many common operations.
The Metal.jl package allows defining arrays on Apple Silicon and overloads many common operations.
The oneAPI.jl package allows defining arrays on Intel GPUs and overloads many common operations.
I’ll illustrate using Metal.jl on my MacBook Pro running MacOS Ventura 13.2.1. It has Apple M2 chip with 38 GPU cores.
versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 1 on 8 virtual cores
Environment:
JULIA_EDITOR = code
Activating project at `~/Documents/github.com/ucla-biostat-257/2023spring/slides/09-juliagpu`
Status `~/Documents/github.com/ucla-biostat-257/2023spring/slides/09-juliagpu/Project.toml`
[6e4b80f9] BenchmarkTools v1.3.2
[dde4c033] Metal v0.3.0
[37e2e46d] LinearAlgebra
3 Query GPU devices in the system
usingMetalMetal.versioninfo()
macOS 13.3.0, Darwin 21.5.0
Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
1 device:
- Apple M2 Max (64.000 KiB allocated)
4 Transfer data between main memory and GPU
# generate data on CPUx =rand(Float32, 3, 3)# transfer data form CPU to GPUxd =MtlArray(x)
usingBenchmarkTools, LinearAlgebra, RandomRandom.seed!(257)n =1024# on CPUx =rand(Float32, n, n)y =rand(Float32, n, n)z =zeros(Float32, n, n)# on GPUxd =MtlArray(x)yd =MtlArray(y)zd =MtlArray(z)# SP matrix multiplication on GPUbm_gpu =@benchmark Metal.@syncmul!($zd, $xd, $yd)
BenchmarkTools.Trial: 9374 samples with 1 evaluation.
Range (min … max): 353.625 μs … 1.796 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 503.771 μs ┊ GC (median): 0.00%
Time (mean ± σ): 531.680 μs ± 109.258 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▁▇ ▅▁▁▃▁ ▁█▇ ▃▆▄ ▄▄
▂▂▄▃▄▅███▇██████▇███▅▃▃▄▇███▇▇▆▃▂▂▁▁▂▃▅████▄▄▂▂▁▁▁▁▁▁▁▁▁▂▂▂▂▁ ▄
354 μs Histogram: frequency by time 825 μs <
Memory estimate: 800 bytes, allocs estimate: 40.
# SP throughput on GPU(2n^3) / (minimum(bm_gpu.times) /1e9)
6.072771008837045e12
# SP matrix multiplication on CPUbm_cpu =@benchmarkmul!($z, $x, $y)
BenchmarkTools.Trial: 1655 samples with 1 evaluation.
Range (min … max): 2.937 ms … 10.839 ms┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.004 ms ┊ GC (median): 0.00%
Time (mean ± σ): 3.021 ms ± 263.584 μs┊ GC (mean ± σ): 0.00% ± 0.00%
▁▄▆▆▇█▆▁
▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▅█████████▆▅▄▄▅▄▂▃▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▂▁▁▂▁▁▁▁▁▂▂ ▃
2.94 ms Histogram: frequency by time 3.12 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
# SP throughput on CPU(2n^3) / (minimum(bm_cpu.times) /1e9)
7.312449639908062e11
We see ~10x speedup by GPUs in this matrix multiplication example.
# cholesky on Gram matrix# This one doesn't seem to work on Apple M2 chip yet# xtxd = xd'xd + I# @benchmark Metal.@sync cholesky($(xtxd))